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Abstract 


Contemporary deep neural networks exhibit impressive results on practical 
problems. These networks generalize well although their inherent capacity may 
extend significantly beyond the number of training examples. We analyze this be¬ 
havior in the context of deep, infinite neural networks. We show that deep infinite 
layers are naturally aligned with Gaussian processes and kernel methods, and de¬ 
vise stochastic kernels that encode the information of these networks. We show 
that stability results apply despite the size, offering an explanation for their empir¬ 
ical success. 

1 Introduction 

Deep neural networks have become widely adopted for tasks ranging from image la¬ 
beling in computer vision to parsing and machine translation in natural language pro¬ 
cessing. The networks in these tasks usually consist of an input layer, several semi- 
structured intermediate layers and an output layer. Surprisingly, as large, complex 
models, they appear easier to learn at scale, rendering state of the art performance 
(e.g., DU) with increasing amounts of data and computation. The setting poses new 
questions for learning since the number of parameters in these models, mostly residing 
in the deep layers, may be substantially larger than what could be supported by the 
training examples. Many expect such networks to overfit while in practice they (often) 
do not, and their decision boundaries appear smooth. Our work suggests an explana¬ 
tion for this behavior based on deep and infinitely wide networks where the number of 
parameters is uncountably infinite. 

Neural networks with a single infinite intermediate layer have been considered by 
various works. Q show these networks are universal approximators and lfl6l l24l |5) 
explore their properties in the context of Gaussian processes and kernel methods. Un¬ 
fortunately, since these networks interact linearly with the input layer, they are limited 


1 



Figure 1: Left and middle images present finite neural networks with one and two intermediate 
layers, respectively. The right image depicts neural network with two infinitely wide intermediate 
layers, one is indexed by w which consists of the functions cj> x ( w ) and the other indexed by u 
and consists the functions ip x (u). The functions <p x {w) are associated with a Gaussian measure 
over w and the functions ip x («) are associated with a Gaussian process over u. 


in their representation power. Moreover, the recent success of neural networks seems 
to rely on deep architecture while current infinite networks only encode the informa¬ 
tion of a single layer. Lastly, these works do not explain why learning the likelihood of 
infinite networks does not overfit, and the decision boundary of the learned network is 
simple. 

In our work we extend the framework of kernel methods for infinite networks to 
multiple layers. We introduce stochastic kernels that are derived from Gaussian pro¬ 
cesses and encode the information of two infinite layers. We also provide a generaliza¬ 
tion bound for these networks, based on stability of regularized loss minimization, and 
attribute the simplicity of the learned deep infinite network to the fast convergence of 
algorithms on our learning framework. 

We begin by introducing infinite neural networks with a single intermediate layer. 
We relate their learning units to integrals over functions in the Euclidean space with 
respect to the Gaussian distribution, as well as describe their connections to kernel 
functions. We subsequently construct the second layer and relate its learning units to 
expectations with respect to a Gaussian process. These expectations form stochastic 
kernel functions that encode the multilayer and infinitely wide neural network. Finally, 
we analyze the generalization properties of these networks and introduce a method 
to incorporate localities and non-linearities such as those arising from convolutional 
neural networks. 


2 Background 

Neural networks form a successful framework for classification that imitate the acti¬ 
vation function of neurons. Finite neural networks are usually described by a layered 
graph, see Figure [T] Its input layer consists of nodes that receive the input signal 
x £ R d . Its subsequent layers consist of parameters that encode the classification pro- 
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cess: its intermediate layers consist of activation nodes. Each activation node rely on its 
parameter to produce a linear response /(( x, w )) according to its parameters w. The 
function /(f) is called an activation function or a transfer function and it introduces 
non-linearities to the network. Transfer functions imitate the neuron behavior, activat¬ 
ing its value whenever its linear input (x, w) is high enough. There are various forms of 
non-linear transfer functions, e.g., step and rectified linear functions. Recently, the rec¬ 
tified linear function ReLU(f) = nrax(0, f) was successfully used in neural networks 
as it carries the neuron signal better DU- Another popular transfer function is the step 
function step(f) = l[f > 0] that attains the value one it t > 0 and zero otherwise. The 
output of a network with a single intermediate layer linearly weights the activations 
f((x,wi}),... t f({x,Wk)) with the output parameters ui,...,Uk- Its classification is 
determined by the sign of u if{( x - w i))- 

A classical result by Hornik asserts that networks with one intermediate layer 
are universal approximators when the number of activation units k tends to infinity 
(9l . Consequently, neural networks have been studied in the infinite setting lfl6l [24l 
m 0 Hu m- In this setting there are infinitely many transfer functions f((w,x)) 
each of them is indexed by w. Summing over infinitely many transfer functions is 
formalized by integrating over possible w. Formally, we replace J2i u if({ x i w i)) 
with f u(w)f((x,w))dp(w). The measure p(w) may be any probability distribu¬ 
tion over R d as long as this integral is finite, e.g, the Gaussian distribution dp(w) = 


(27r)- d / 2 expHMl7 2 )- 


When taking a discriminative approach, one learns a neural network that best de¬ 
scribes the training data S = {(xi, yi),(x m , y m )}, where Xi is a data instance (e.g., 
an image or a sentence) and y, is its semantic label. While learning an infinite network 
with a single intermediate layer, one needs to consider compact ways to represent the 
function u(w). Kernel methods can be used for this task while representing the classi¬ 
fier by its dual 0. Particularly, the network’s output is an inner product between u(w) 
and an input-dependent function = f({x,w)) 



( 1 ) 


Since u(w) is trained over a finite space of feature functions it can be restricted without 
loss of generality to the linear span of the training feature functions <f> Xl (w), ..., <f> Xm (w), 
namely u(w) = a i4 , x i {w) for some real valued numbers ai,..., a m . Therefore, 

when evaluating the output value of the network (u, (f> Xj )ii it suffices to compute the 
kernel entries 



Various works have already computed the kernel function for different transfer func¬ 
tions with respect to the Gaussian measure, including the rectified linear and the sign 
function ll24l |6). In all these cases, the kernel has an analytic form although the 
features <f> x {w) are not finite vectors but functions over R d . Explicitly, let pij = 


3 


then 


foteLU (xi,Xj) = X ‘ ^ Xj sin (arccos (pi,j)) + (n - arccos {pij))pi,j- 
fcstep (xi,xj) = 7r-arccos (ptj). 

Whenever the measure is not Gaussian there is no analytic solution for the different 
kernels. Nevertheless, whenever the probability density function dp(w) is log-concave 
(i.e., log (dp(w)) is a concave function) then (x, w) is a log-concave function thus the 
sample complexity of the kernel function decays exponentially with the number of 
samples. 

3 Stochastic kernels for deep and infinitely wide neural 
networks 

A deep learning architecture considers multiple intermediate layers. Deep architectures 
have proven successful as they allow to express non-linearities easily. Unfortunately, 
when considering multiple infinite layers there are difficulties to represent the network 
parameters. Such difficulties do not appear when considering finite layers since all 
parameters in all layers are vectors. However, when considering infinite layers, the pa¬ 
rameters are functions (in the second intermediate layer) functions of functions (in the 
subsequent layer) and so on, see Figure[l] In the following we present the framework of 
learning with multiple intermediate layers. For the clarity of presentation we describe 
two intermediate layers. We refer to these networks as deep networks to differentiate 
them from the known networks with a single infinite layer. 

The main challenge in working with deep infinite networks is to establish the space 
in which the deep neurons exist. The neurons of the second intermediate layer take 
as input the functions <f> x (w), which are the output of the first intermediate layer, i.e., 
4> x (w) for any w £ R d . Therefore, each neuron in the second layer is a function 
u : R d —> R that weights its input values (which are <fi x (w) for any w £ R d ) in a 
linear manner (u, <j> x )u,- The output of each such neuron is the activation of the transfer 
function ip x (u) = f{{<j> x ,u)fi ). Therefore, the output layer of deep infinite network 
needs to take all its inputs, i.e., ijj x (u) for any function u(-), and weight their activation 
by v(u). Thus the output layer computes the linear function (v,ip x ) v . with respect 
the a measure v. Next, we determine the measure space of v(u) in terms of stochastic 
processes. 

It is natural to consider the activation of neurons in the second intermediate layer 
(<j) x , u)fj, with respect to the measure p(w) using probabilistic terms. This linear func¬ 
tion is the covariance of two random variables w) M = E w ^^ .[<j> x (w)u(w)\. Sim¬ 
ilarly, the activation of the output neuron is (v, ij) x ) v = E ur ^ v [ip x (u)v(u)]. With this 
perspective, the functions u : R d R are chosen randomly according to the mea¬ 
sure v. Equivalently, v is a stochastic process. In our work we restrict ourselves 
to a Gaussian process, a stochastic process for which any finite collection of ran¬ 
dom variables u(wi), ..., u(wk) has a multivariate Gaussian distribution. A Gaus¬ 
sian process is completely determined by its first and second order statistics. The 
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mean function p(w) of a Gaussian process is E u „ v [u(w)]. Its covariance function 
G(w 1 , 1 x 2 ) = E ur ^ v [u(wi),u(w 2 )\- We consider Gaussian process with zero mean 
function and a general covariance function, thus we denote v = GP(C). 

To learn a deep infinite network that linearly separates the training examples it 
suffices use the stochastic kernel function: 

Xj) = {ip Xi ,ip Xj ) v = E u ^ GP{c) [f(((j) Xi ,u))f{{<l) Xj ,u))\ (2) 

Recall that the first layer responses are the transfer functions <f> Xi (w) = f{(w,Xi)),4> x \w) = 
f((w,Xj)). Thus a stochastic kernel for deep infinite network averages non-linearities 
while considering their covariances. 

Although the Gaussian process has infinitely many random variables, its unique 
properties allows to compute the stochastic kernel function analytically. 

Theorem 1. 

&/ 1 2) (xi, Xj ) = E {Z1>Z2) „ N{ o, e) [f{z 1 )f(z 2 )] 

z = (zi, z 2 ) is a bivariate Gaussian random variable with zero mean and covariance 
matrix £: 


/ f((wi,x i ))C{w 1 ,w 2 )f((w 2 ,x i )) f((w 1 ,x i ))C{w 1 ,w 2 )f({w 2 ,x j )) \ 

E = E W ( ) (3) 

V f({w 1 ,x i ))C(w 1 ,w 2 )f({w 2 ,Xj)) f({w 1 ,x j ))C{w 1 ,w 2 )f({w 2 ,x j )) J 

w i, w 2 are chosen independently from a d—dimensional multivariate Gaussian 
with zero mean and unit covariance, i.e., N( 0, 1). 

Proof. Z\ = (4> Xi ,u) is a Gaussian random variable with zero mearQ Similarly, z 2 = 

(<j) x . u) is a Gaussian random variable and both z\, z 2 are jointly Gaussian. Thus z = 

(zi,z 2 ) is a bivariate Gaussian random variable with zero mean and some covariance 
matrix £. The expected value of a Gaussian process reduces to 

E u ~GP(C) u))f{{(j) Xj , u))] = £( Zl , Z2 )~JV(o,d) [f(zi)f(z 2 )]. 

The covariance matrix of £ is a 2 x 2 matrix whose (r, s) entry is E zeN ( 0 ^- ) E[z r z s \. 

Recall that z\ = E w [cj> Xi (w)u(w)] and that £n = E[zf], then 


£-ii — E„ 


— E v 


E Wl [<j> Xi {wi)u{wi)\ ■ E W2 [^ Xi (w 2 )u(w 2 )\ 


4>x i {wf)<t> x Aw2)E u [u{wf)u(w2)] 


— E v 


4>Xi (w’l) {w 2 )C(wi,w 2 ) 


We used Fubini’s theorem to change the order of integration. The values of £ r<s then 
follow in the same manner, while recalling that <j> Xi (wf) = f({xi, wf)) and 0 X . (w 2 ) = 

f{(xj,w 2 )). D 

1 This is a classical result and can be shown by working with the Riemann-Stieltjes integral, decomposing 
it to finite sums. Since any finite instantiation of a Gaussian process is a multivariate Gaussian random 

variable with zero mean, the Riemann-Stieltjes sum is also a Gaussian random variable, thus the limit (using 
the characteristic function) is also Gaussian. 
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An important family of Gaussian processes is described by shift-invariant covari¬ 
ance functions, namely C(w i, w?) = c(w 4 — w 2 ). Bochner’s theorem represents such 
covariance functions as E w j,[g((wi,w) + b)g((w2,w) + 6)], where w is drawn from 
a distribution p over R d , b is drawn from the uniform distribution over [0, 27 t] and 
g(t) = v/2cos (t) fl7l . Whenever p is known we are able to efficiently compute a 
stochastic kernel for deep infinite networks and shift-invariant covariance functions: 

Corollary 1. Let C(w 1,102) = c(w± — W2) be a shift-invariant covariance function 
and let p be its corresponding measure derived by Bochner’s theorem. Consider the 
6x6 covariance matrix 

( lltCill 2 (Xi,w) ( Xi,Xj ) \ 

{Xi,w) ||w|| 2 (w,Xj) 1 

{Xi,Xj) ( Xj,w) \\XjW 2 ) 

A® B the tensor product of two matrices. Let g(t) = \/2cos (t) and assume that b is 
drawn form the uniform distribution over [0,27r] and w is drawn according to p and 
z ~ TV (0, E) is a multivariate Gaussian. Then the covariance matrix E of the stochas- 

/o\ 

tic kernel for deep infinite neural networks kj (xi,Xj) = -£'(z 1 ,z 2 )~a/'(o,s) [f{ z i)fi z 2 )] 
is 

/ f(zi)f{z2)g{z 3 + b)g(z 4 + b ) f(z 4 )f(z e )g(z 3 + b)g(z 4 + b) 

^ = E W ,b,z I 

V f(zi)f(z 6 )g(z 3 + b)g(z 4 + b) f(z 5 )f(z 6 )g(z 3 + b)g(z 4 + b) 

Proof. The entries of the covariance matrix of the stochastic kernel are derived in 
Equation Q using E r , s = h r , s (z 1: ..., z 6 ) where z 4 = (w 1 ,x i ),z 2 = (w 2 ,Xi),z 3 = 
(w 4 ,w),z 4 = (w 2 ,w),z 5 = (w 1 ,x j ),z 6 = ( w 2 ,Xj ). Since w 1 ,w 2 ~ N(0,I) are in¬ 
dependent then z is a multivariate Gaussian with zero mean and its distribution is fully 
determined by its covariance matrix E. The corollary then follows by direct computa¬ 
tion of the covariance matrix, e.g.„ Ei j3 = Ez[z\Z 3 ] = Y r s E Wl [wi^ r wi^Xi tr w s ] = 
Y r s x i,r w s ' E Wl [w\^ r w\^ s } and E Wx [w i, r t«i,g] = l[r = s] is the indicator function 
that equals one if r = s and zero otherwise. □ 

The ability to realize the measure p that is suggested by Bochner’s theorem deter¬ 
mines the validity of this approach. Bochner’s theorem relates a shift-invariance func¬ 
tion c(w 4 - u'2) to Fourier transform, thus p can be recovered by its inverse-transform. 
There are some special covariance functions for which this measure is known. For 
example, the covariance function C(w 4 ,W 2 ) = exp(—||uii — w> 2 ||i) that relates to the 
Ornstein-Uhlenbeck Gaussian process can be computed using the Cauchy distribution 
d P (w) = nt ^(^(1 + w f)) 1 . Whenever the covariance function defines a a squared 
exponential Gaussian process, C(w 4 ,w 2 ) = P exp(— ||ti>i — the stochastic ker¬ 

nel for deep neural networks can be computed analytically. This follows from the 
observation that the Gaussian process couples the independent d— dimensional Gaus- 
sians random variables w 4 ,W 2 to a 2d—dimensional Gaussian variable w = ( 1 x 1 , 102 ) 
with correlation a: 
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Corollary 2. Consider the covariance function C(wi,w 2 ) = (l+2a) 1+d ^ 2 exp(—a||u>i — 
u> 2 || 2 /2). Consider the 4x4 covariance matrix 


£ = 


( INI 2 

(Xi, Xj) 

\<z( 

1 + a 

a \ 

V ( x i’ x j) 

INI 2 , 


a 

l + a J 


Then the covariance matrix £ of the stochastic kernel for deep neural network kj ; [xi , Xj) = 
E (z u z 2 )~ N ( 0 ,V)[f(zi)f(Z 2 )\ is 


( f{h)f{z 2 ) f{zi)f{z A ) \ 
f{zf)f(z A ) f{z 3 )f{z A ) J 


(o') (‘ 2 ') 

Moreover, fc^ LU (a;j, Xj) and kg t (. p (xj, x-f) have analytic forms. 

Proof Considering Equation 0 we note that gi(wi)gi(w 2 )C(wi,w 2 ) = (l+2a)g£(«j), 
where gi{wf) is the d— dimensional Gaussian density function N(0,I) and g-^{w) is 
the 2d—dimensional Gaussian density function N{ 0, £). We denote by A (g) B the 
tensor product of two matrices, thus 


(l + 2a)£ 


1 + a a 
a 1 + a 


® Idxd 


The form of £ is attained when setting z\ = ( W\,Xi ), z 2 = (w 2l xf), z 3 = ( w\,Xj), 
z A = (w 2 , Xj). With this notation, the form of £ is a direct consequence of Equation 

0- 


To compute the entries of £ when /(f) = ReLU(f) we recall that whenever 
z'x, z' 2 £ N( 0, £') with £' n = of, £' 12 = £ 21 = pcricr 2 , £ 22 = of then E Z ^ A _ [f (z[) f (z' 2 )] = 
h(ai, a 2l p) and h(ai, a 2 , p) = pp sin(arccos(p))+p(7r—arccos(p)). Then £r 6 lu = 


h(VT+~a\\xi\\,VTTa\\xi\\, j^) 
hiVl + aWxilVl + aWxjlj^ 


h(s/TTa\\xi\\, VTTa\\xj\\, ^ 
fz(vT + a II^J II: VT + a||xj||, 


(Xi,Xj) 


l+a > 


Thus, k^ LU (xi,Xj) = -E( Zi , Z2 )^jv( 0 ,e)[/( 2 i)/(^ 2 )] is a recursive application of fi(-) 
with the appropriate parameters. 

To compute the entries of £ when /(f) = l[f > 0] we recall that whenever z[,z' 2 £ 
fV(0, £') with £' u = of, £' 12 = £ 21 = pcricr 2 , £ 22 = of then E z ^ [/(4)/(2 2 )] = 
h{p) = 7T — arccos(p). Then 


£ 


step 


h( 

Hih 


1+a ^ 

(Xi 


U( a \ \ 

Jlxdlkril > I 

Mife) J 


(O') 

As before, k^ p (xi,Xj) = E( Zl}Z2 ^ Ni0 ^ ) [f(z 1 )f{z 2 )] is a recursive application of 
h(-) with the appropriate parameters. □ 


Deep neural networks are usually applied to multiclass problems, where there are 
more than two labels to classify. Thus the label space resides in a discrete set y £ 
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{1, K}. For notational convenience we focus above on binary classification, where 
y £ { — 1,1} is determined by the sign of the output layer (v, <p x }. In multiclass setting, 
a data-instance x can belong to any of the K classes. A standard extension of the above 
setting to multiclass learning is to introduce K decision boundaries v\(u), ...,Vk{u). 
Multiclass prediction is performed by choosing the decision which is most certain, i.e., 
arg ma,Xi(vi, ip x )- In the next section we describe the generalization properties of deep 
infinite neural networks in the multiclass setting. 

4 Deep infinite networks, generalization and experimen¬ 
tal validation 

The practice of neural networks proves that they do not overfit, even when the number 
of learned parameters is orders of magnitude larger than the number of training exam¬ 
ples. In the following we address this scenario while suggesting some insight for why 
infinite networks generalize well. We show that generalization is mostly dependent on 
the expressive power of the output layer, which is regularized by its capacity. Consider 
a multiclass deep infinite network v± (it), ...,Vk(u) that classifies the functions ip Xi ( u ) 
according to the most certain linear response function y v {x ) = argmax^rij, ip x ). 
Since each decision function Vi(u) interacts linearly with the training data, it must 
be a finite sum of these functions, i.e., Vi(u) = a i,j'4 > x j ( u )- Therefore, as 

long as the functions ip Xi {u) are simple (e.g., truncated linear functions in the case 
of ReLU units), the capacity of the deep infinite network is limited by the size of 
the training data. Whenever there are stronger guarantees on the data, i.e., that the 
training data is separable with a margin, they translate to a stronger regularization on 
Vk(u) that is derived from the passive-aggressive learner ( 8 ). To be more precise, 
we say that the data is separable when there are functions v*(u), that clas¬ 

sifies correctly any data instance. Formally, for any data-label pair (x, y) there holds 
V = Vv* (%)■ These data-label pairs are separated with a margin if y = y v *(x) and 
(Vy, ipx) > 1 + ma Xi^ y (v *In this setting, the kernel version of the passive- 
aggressive algorithm ensures that Vi(u) = ]Cj=i a i,j t i , x j (u), where t < B 2 JT ||u *|| 2 
and 11 ipx 11 2 < II. Thus, whenever there is a separation with margin and the training 
size to t the passive-aggressive analysis ensures that the deep learner has restricted 
capacity thus a simple form. 

Unfortunately, the separable setting rarely exists in practice. Nevertheless, deep 
learners perform well in the non-separable setting. Usually deep learning schemes use 
the logistic regression framework, that maximizes the conditional probability of the 
training data S = {(a?i, 3 / 1 ), (:im,l/m)}- The conditional probability follows the 
Gibbs distribution p v (y\x) = exp ((v y ,tl> x ))/Z(v) where Z(v) = JU exp((uj, ip x )) is 
the partition function. Thus the parameters of the network are learned by the optimiza¬ 
tion program: 



(4) 


i=1 


3 


As this is an infinite convex program it is appealing to consider its dual. The dual 
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program is min Q a i>k loga iifc + Y!iL\ \\vi(a )\\ 2 /2X m where v(a) = Y^iO’vi - 
(Sr- a i,kipx k )) and Yhk a i,k = 1- The dual program is smooth and strongly convex, 
therefore enjoys rapid convergence, i.e., with 0(log(l/e)) updates to the elements a 
a dual exponentiated coordinate descent achieves an e— optimal dual solution El- Al¬ 
though this algorithm achieves a good primal solution in practice, it does not guarantee 
that v(a) is an e— optimal primal solution as well. Recently, many efficient algorithms 
were devised to achieve both dual and primal guarantee with 0(log(l/e)) steps (cf. 
HMD- These algorithms aggregate data points t/> Xi to their separators v(u) therefore 
after a small number of steps a good, yet simple separator is reached. Said differently, 
although different separators may exist around v s (u) the algorithm outputs a fairly 
simple separator as it is regularized by an early stopping criterion. 

Considering the learning problem in Equation (|4| as a loss minimization task, it 
measures the average log-loss given training data. By the above, the empirical risk 
minimizer v s is simple, i.e., it consists of 0(log(l/e)) functions t/> x (u). We turn 
to show that this simple empirical risk minimizer also generalizes well, it achieves a 
similar log-loss even when the data-label pairs are sampled from their true distribution 
in the world. 

Theorem 2. Assume that ||(/i 3 .|| < 1 and that the training examples are sampled inde¬ 
pendently from the data-label generating distribution (x, y ) ~ D. Denote log-risk by 
L d (v) = E( x>y )„ D p v (y\x) and the empirical risk by L s {v) = A YJHi logp v {yi\xi)- 
Consider v s as defined in Equation then \Ld(v s ) — Lg(v°) | < l/mA m . 

Proof Generalization by stability for convex and Lipschitz loss functions with strongly 
convex regularizer was established in l3liT5U22i. Although the technical details are ob¬ 
scured in some of these results, we rely on their derivations (specifically 13 Theorem 
22 and ll22l Theorem 2). The benefit of working with stability is that its basic concepts, 
convexity and Lipschitz continuity, readily generalize to infinite spaces. To apply gen¬ 
eralization via stability to multiclass logistic regression we note that — \ogp v (y\x) is 
convex. Also, it is 1— Lipschitz since its gradient is uniformly bounded by 1 whenever 

IhM <i- □ 

The regularization ratio A m is chosen such that mX m goes to zero as m tends to 
infinity. The important conclusion of the above theorem is that infinite models does 
not necessarily overfit, as long as the infinite model interacts in a constrained manner 
with the data. In our case the infinite model is constrained by convexity and Lipschitz 
continuity. These two properties stabilize the learning procedure while ensuring that 
small changes in v do not change the prediction by much. 

Next, we turn to experimentally validate our framework. The effectiveness of 
infinite network with a single infinite layer using the kernels /cReLU^t, xf) was al¬ 
ready demonstrated by mm. Thus in the following we show that our stochastic 
kernels xf) with a squared exponential Gaussian process improves upon 

kReLu(xi,Xj). We run our kernels over MNIST digit database. This dataset is the 
standard entry point of neural networks and kernel methods. 

Our stochastic kernel k^ LlJ (xi,Xj) was able to separate the training data com¬ 
pletely with only 50 iterations, while (cReLut^i) x j) that encodes a single infinite layer 
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did not (it nearly separated all examples). This validates the assertion that the stochas¬ 
tic kernel is more expressive than the single layer kernel. As for test results, the average 
error over all digits is 1.7% for the stochastic kernel and 1.9% for the single layer ker¬ 
nel. Although the improvement is modest in terms of the overall success rate (only 
0 .2%) it might be insightful to compare it to the possible gain over the errors of the 
single layer kernel, namely 0.2/1.9 which is about 10% gain. Lastly, since our kernel 
function is computed analytically, it is trained as fast as any kernel method. 


5 Non-linearities 

The infinite layers, presented in Section[2]and Section[3] are limited in their expressive 
power. In the first intermediate layer, the inner product (x, w) is performed for any 
w £ R d , while each parameter w acts globally on all input entries x £ R d linearly. 
Similarly, in the second layer, the function u(w) acts linearly and globally on every 
4> x (w) = f((w,x)). These interactions ignore spatial information in the vectors x or 
the feature function <f) x (w), spatial information that is important in computer vision and 
language processing applications. Current deep learning architectures exploit spatial 
information using convolutions. These convolutions are applied to patches in an image, 
or equivalently to overlapping subsets of the data instance x, and recursively to their 
functions. These operations introduce important aspects of non-linearity and locality. 
Our approach can be extended to deal with such operations, thus able to increase the 
expressiveness of our approach to various non-linearities. 

To describe a convolution-based operation in the first intermediate layer, we trans¬ 
form the data instance x £ R d to subsets of its elements a/ 1 ),...., x^ £ R dl , where 
x( p ) C x. For each such subset we learn infinitely many responses w £ R dl , while 
each response outputs <p x , P (w) = f((w,x^)). Thus, <f> x = (<f> x ,u4 > x ,p) is a 
P —dimensional function, <j> x : R dl -£ R p . Note that in Section[2]the feature function 
(t> x mapped R d to R. 

Convolution based operations in the second intermediate layer may also be applied. 
The feature function <f> x is transformed to subsets of its elements p x \ ..., (/Jp' 1 where 
c/PP C <fi x , i.e., (jPP : R dl —> R d2 that is attained by restricting to 4> x (w) to some of 
its coordinates. Each of these subsets is weighted by u : R dl —> R d ' 2 and its resulting 
response is ij) x , q {u) = f((u, while (- u,</> x (q )) = E w [{u(w),<j>l(w))] and the 

latter inner product ( u(w ), (j>%{w)) is between two vectors in R d2 . 

The above two constructions show how to integrate convolution-type non-linearities 
in deep infinite networks. The appropriate kernels follow a straight forward derivation 
of these higher dimension constructions. 

6 Related work 

Neural networks, kernel methods and Gaussian processes have had a significant impact 
on the machine learning community and a full exposition of these methods can be 
found in machine learning textbooks on neural networks m , kernel methods ED and 
Gaussian processes ED- 
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Neural networks are attracting a considerable attention in the last few years. Their 
practical success is unmatched in several machine learning applications (e.g., HD). In 
recent years it was possible to construct deep learning architectures with considerable 
number of parameters that is significantly larger than the number of training exam¬ 
ples. Surprisingly, these networks avoid overfitting. Several machine learning theories 
were devised to explain how deep networks avoid overfitting based on dropouts (e.g., 
MM)- Our approach is different since we represent neural networks with significant 
amount of parameters as infinite networks with multiple layers. We encode the neu¬ 
rons responses in functions, while each layer increases the complexity of its functions, 
namely the first layer consists of functions over the Euclidean space and the second 
layer consists of Gaussian processes. We avoid overfitting since our algorithm achieves 
an almost optimal solution with a few steps, thus our resulting classifier is simple to 
represent and regularized by early stopping. We provide a generalization bound for our 
classifier based on stability lf3l [151 1221 . 

Infinite neural networks were introduced by MM in the context of Bayesian 
learning. They analyze the predictive probability of a neural network with an infinitely 
wide intermediate layer. In particular, when the transfer function is bounded, this pre¬ 
dictive probability converges to a Gaussian process. When resolving the covariance 
function of this process, lf24l realized the kernel k el {(xi,Xj) along with other kernel 
functions. This work differs from ours in a few respects. First, our work does not 
consider the predictive probability of labels given data but rather we aim at maximiz¬ 
ing the likelihood of infinitely wide layers, a task that initially was supposed to overfit 
and generalize poorly 123 - We establish the prediction power of our approach using 
stability. Second, we build on multiple intermediate layers while trying to analyze the 
success of deep learning architectures, as opposed to in El. Lastly, our work con¬ 
siders Gaussian processes differently than ||24l . We use Gaussian process to define a 
measure over our second (e.g., deep) intermediate layer. 

More recently, researchers explored different algorithms to learn infinite neural net¬ 
works 0. This work formulates learning an infinite network as an infinite convex 
program and devise an incremental algorithm that is based on its dual representation. 
|fl8l suggest to optimize an infinite networks with a single layer using randomization 
to decrease the computational complexity of the learning algorithm. Our work ad¬ 
dresses other properties of learning infinite networks, mainly Gaussian processes for 
constructing multiple layers and analyze how infinite networks avoid overfitting. 

Kernel methods for infinite neural networks are further explored in 0(6). These 
works introduce the kernels k step (xi, Xj ), k’ReLU^i, %j) along with other kernels thus 
augment the works of MM- Moreover, they introduce kernel composition approach 
to simulate deep architecture. Our work differs in the way we address and analyze 
deep architectures of infinite networks. We construct deep layers that use as input 
their previous layer using Gaussian processes. Connections between kernel methods 
and Gaussian processes were left as an open problem in j6). We also introduce a way 
to incorporate non-linearities and invariances such as convolutional neural networks 
in our framework, another open problem raised by Q. In addition, we analyze why 
our networks avoid overfitting. 0 demonstrate the effectiveness of these kernels in 
language processing. 

In our work we provide unbiased estimate to our kernels in the second layer using 
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Bochner’s theorem. These kernels consider a shift invariant covariance function. 133 
suggest the same estimator for kernel functions in the context of random features in 
kernel methods. H2 suggested improved methods to reduce the variance of these 
estimates. Such estimators were recently used within kernel methods to match deep 
learning results in language processing mm. 


7 Discussion 

Deep neural networks are successful in machine learning applications although the 
number of their parameters is orders of magnitude larger than the number of training 
examples. In this work we explain this behavior using deep infinite neural networks. 
We construct stochastic kernels that rely on Gaussian processes to encode such net¬ 
works. We also explain how to introduce locality and non-linearity to such networks, 
similarly to the ones introduced by convolution neural networks. Lastly, we provide 
generalization bounds and regularity conditions that explain why these networks do 
not overfit. We present our framework with only two intermediate layers mainly for 
simplicity. It can be extended to any depth but the higher layers may not use non- 
linearities. The problem of finding analytic forms of stochastic kernels that encode 
arbitrarily deep layers with non-linearities is largely open. 

The work combines mostly separate areas in machine learning, including kernel 
methods, neural networks and Gaussian processes. As such, there are many direction 
that still need to be explored. Importantly, which non-linearities are significant in deep 
infinite networks and whether they can be learned from data. What probabilities best 
fit this framework and are there other properties of stochastic processes, besides of 
covariance, that control learning? 


References 

[1] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends ® in 
Machine Learning, 2(1): 1—127, 2009. 

[2] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice 
Marcotte. Convex neural networks. In NIPS, pages 123-130, 2005. 

[3] Olivier Bousquet and Andre Elisseeff. Stability and generalization. The Journal 
of Machine Learning Research, 2:499-526, 2002. 

[4] Chih-Chieh Cheng and Brian Kingsbury. Arccosine kernels: Acoustic modeling 
with infinite neural networks. In ICASSP, pages 5200-5203, 2011. 

[5] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In 
Advances in neural information processing systems, pages 342-350, 2009. 

[6] Youngmin Cho and Lawrence K Saul. Large-margin classification in infinite neu¬ 
ral networks. Neural computation, 22(10):2678-2697, 2010. 


12 


[7] M. Collins, A. Globerson, T. Koo, X. Carreras, and P.L. Bartlett. Exponenti¬ 
ated gradient algorithms for conditional random fields and max-margin markov 
networks. The Journal of Machine Learning Research , 9:1775-1822, 2008. 

[8] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram 
Singer. Online passive-aggressive algorithms. The Journal of Machine Learning 
Research , 7:551-585, 2006. 

[9] Kurt Hornik. Some new results on neural network approximation. Neural Net¬ 
works, 6(8): 1069-1072, 1993. 

[10] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana 
Ramabhadran. Kernel methods match deep neural networks on timit. In ICASSP, 
2014 , pages 205-209, 2014. 

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification 
with deep convolutional neural networks. In NIPS , pages 1097-1105, 2012. 

[12] Quoc Le, Tamas Sarlos, and Alex Smola. Fastfood-approximating kernel expan¬ 
sions in loglinear time. In Proceedings of the international conference on machine 
learning, 2013. 

[13] Zhiyun Lu, Avner May, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, 
Aurelien Bellet, Linxi Fan, Michael Collins, Brian Kingsbury, Michael Picheny, 
et al. How to scale up kernel methods to be as good as deep neural nets. arXiv 
preprint arXiv:1411.4000, 2014. 

[14] Laurens Maaten, Minmin Chen, Stephen Tyree, and Kilian Q Weinberger. Learn¬ 
ing with marginalized corrupted features. In ICML, pages 410-418, 2013. 

[15] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning 
theory: stability is sufficient for generalization and necessary and sufficient for 
consistency of empirical risk minimization. Advances in Computational Mathe¬ 
matics, 25(1-3): 161-193, 2006. 

[16] Radford M Neal. Bayesian learning for neural networks. PhD thesis. University 
of Toronto, 1995. 

[17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel ma¬ 
chines. In NIPS, pages 1177-1184, 2007. 

[18] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Re¬ 
placing minimization with randomization in learning. In NIPS, pages 1313-1320, 
2009. 

[19] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006. 

[20] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient 
method with an exponential convergence .rate for finite training sets. In NIPS, 
pages 2663-2671, 2012. 


13 



[21] Bernhard Scholkopf, Christopher JC Burges, and Alexander J Smola. Advances 
in kernel methods: support vector learning. MIT press, 1999. 

[22] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. 
Learnability, stability and uniform convergence. JMLR , 11:2635-2670, 2010. 

[23] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regu¬ 
larization. In NIPS, pages 351-359, 2013. 

[24] Christopher KI Williams. Computing with infinite networks. Advances in neural 
information processing systems, pages 295-301, 1997. 


14 



